116 ◾ Bioinformatics
bcftools mpileup -Ou \
-f reference_file \
-b bam_list_file \
| bcftools call -vmO z \
-o sarscov2.vcf.gz
The “bcftools mpileup” is the command used for the pileup. The “-f” option specifies the
reference file used by the aligner to produce the BAM file, “-b” option specifies the BAM
file from which we wish to call variants. The above form allows both pileup and variant
calling.
In the following, we will discuss an example of variant calling pipeline using “bcftools”.
We will use SARS-CoV-2 raw sequence data from the NCBI SRA database to demonstrate
variant calling.
SARS-CoV-2 is the virus that causes COVID-19 which affected millions of people
around the world and caused thousands of deaths. The virus mutates in a short period of
time, producing new strains. The following are the NCBI SRA run unique identifiers of raw
data generated by whole genome sequencing of SARS-CoV-2:
SRR16989486
SRR16537313
SRR16537315
SRR16537317
In the following, we wish to identify variants (SNVs and InDels) from short reads of those
four samples and report that in a VCF file.
The programs used with this example include SRA toolkits, wget, gzip, samtools, bwa,
and bcftool on Linux.
The workflow for variant discovery includes: (i) acquiring the raw data (FASTQ files), (ii)
quality control if required, (iii) downloading and indexing the reference genome, (iv) using
an aligner to map reads in FASTQ files to a reference genome to generate a BAM file, (v)
sorting the aligned reads in BAM files, (vi) removal of duplicate reads from the BAM files,
and (vii) performing variant calling and generation of a single VCF file for genotyping of all
samples. Any new generated BAM files must be indexed before proceeding to the next step.
In the following, we will discuss each of these steps. First, we need to open the Linux ter-
minal, create a directory called “sarscov2” for this project, and make it the current working
directory.
mkdir sarscov2
cd sarscov2
4.2.1.1.1 Acquiring the raw data
The raw data files are stored in the NCBI SRA database with above SRA run IDs. The
FASTQ files are in sequence read archive (SRA) format that requires SRA toolkits to
be downloaded and extracted. For that purpose, we can use “fasterq-dump” with the
following syntax: